computer vision model
From Visual Question Answering to multimodal learning: an interview with Aishwarya Agrawal
You were awarded an Honourable Mention for the 2019 AAAI / ACM SIGAI Doctoral Dissertation Award. What was the topic of your dissertation research, and what were the main contributions or findings? My PhD dissertation was on the topic of Visual Question Answering, called VQA. We proposed the task of open-ended and free-form VQA - a new way to benchmark computer vision models by asking them questions about images. We curated a large-scale dataset for researchers to train and test their models on this task.
- North America > Canada > Quebec > Montreal (0.04)
- Asia > India > Gujarat > Gandhinagar (0.04)
- Personal > Interview (0.65)
- Research Report > New Finding (0.47)
- Personal > Honors > Award (0.35)
Adversarial Examples that Fool both Computer Vision and Time-Limited Humans
Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.
Geoclidean: Few-Shot Generalization in Euclidean Geometry
Euclidean geometry is among the earliest forms of mathematical thinking. While the geometric primitives underlying its constructions, such as perfect lines and circles, do not often occur in the natural world, humans rarely struggle to perceive and reason with them. Will computer vision models trained on natural images show the same sensitivity to Euclidean geometry? Here we explore these questions by studying few-shot generalization in the universe of Euclidean geometry constructions. We introduce Geoclidean, a domain-specific language for Euclidean geometry, and use it to generate two datasets of geometric concept learning tasks for benchmarking generalization judgements of humans and machines. We find that humans are indeed sensitive to Euclidean geometry and generalize strongly from a few visual examples of a geometric concept. In contrast, low-level and high-level visual features from standard computer vision models pretrained on natural images do not support correct generalization. Thus Geoclidean represents a novel few-shot generalization benchmark for geometric concept learning, where the performance of humans and of AI models diverge. The Geoclidean framework and dataset are publicly available for download.
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
The rapid development of large language and vision models (LLVMs) has been driven by advances in visual instruction tuning. Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs. These advancements are attributed to multifaceted information required for diverse capabilities, including fundamental image understanding, real-world knowledge about common-sense and non-object concepts (e.g., charts, diagrams, symbols, signs, and math problems), and step-by-step procedures for solving complex questions. Drawing from the multifaceted information, we present a new efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages multifaceted rationale to enhance understanding and answering capabilities. To embed lengthy rationales containing abundant information, we employ the Mamba architecture, capable of processing sequential data with linear time complexity. We introduce a new concept of traversal of rationale that facilitates efficient embedding of rationale. Subsequently, the backbone multimodal language model (MLM) is trained to generate answers with the aid of rationale. Through these steps, Meteor achieves significant improvements in vision language performances across multiple evaluation benchmarks requiring diverse capabilities, without scaling up the model size or employing additional vision encoders and computer vision models.
Adversarial Examples that Fool both Computer Vision and Time-Limited Humans
Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.
- North America > United States > Virginia (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Maryland (0.04)
- (6 more...)
- Information Technology > Security & Privacy (0.94)
- Health & Medicine > Therapeutic Area > Neurology (0.48)
- North America > United States > Virginia (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Maryland (0.04)
- (6 more...)
- Information Technology > Security & Privacy (0.94)
- Health & Medicine > Therapeutic Area > Neurology (0.48)
Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning?
Kochnev, Roman, Goodarzi, Arash Torabi, Bentyn, Zofia Antonina, Ignatov, Dmitry, Timofte, Radu
Optimal hyperparameter selection is critical for maximizing the performance of neural networks in computer vision, particularly as architectures become more complex. This work explores the use of large language models (LLMs) for hyperparameter optimization by fine-tuning a parameter-efficient version of Code Llama using LoRA. The resulting model produces accurate and computationally efficient hyperparameter recommendations across a wide range of vision architectures. Unlike traditional methods such as Optuna, which rely on resource-intensive trial-and-error procedures, our approach achieves competitive or superior Root Mean Square Error (RMSE) while substantially reducing computational overhead. Importantly, the models evaluated span image-centric tasks such as classification, detection, and segmentation, fundamental components in many image manipulation pipelines including enhancement, restoration, and style transfer . Our results demonstrate that LLM-based optimization not only rivals established Bayesian methods like Tree-structured Parzen Estimators (TPE), but also accelerates tuning for real-world applications requiring perceptual quality and low-latency processing.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
SynGen-Vision: Synthetic Data Generation for training industrial vision models
Dubey, Alpana, Kuriakose, Suma Mani, Bhardwaj, Nitish
We propose an approach to generate synthetic data to train computer vision (CV) models for industrial wear and tear detection. Wear and tear detection is an important CV problem for predictive maintenance tasks in any industry. However, data curation for t raining such models is expensive and time - consuming due to the unavailability of datasets for different wear and tear scenarios. Our approach employs a vision language model along with a 3D simulation and rendering engine to generate synthetic data for var ying rust conditions. We evaluate our approach by training a CV model for rust detection using the generated dataset and tested the trained model on real images of rusted industrial objects. The model trained with the synthetic data generated by our approa ch, outperforms the other approaches with a mAP50 score of 0.87. The approach is customizable and can be easily extended to other industrial wear and tear detection scenarios.